Design an agent to fly a quadcopter, and then train it using a reinforcement learning algorithm of your choice!
Try to apply the techniques you have learnt, but also feel free to come up with innovative ideas and test them.
Take a look at the files in the directory to better understand the structure of the project.
task.py: Define your task (environment) in this file.agents/: Folder containing reinforcement learning agents.policy_search.py: A sample agent has been provided here.agent.py: Develop your agent here.physics_sim.py: This file contains the simulator for the quadcopter. DO NOT MODIFY THIS FILE.For this project, you will define your own task in task.py. Although we have provided a example task to get you started, you are encouraged to change it. Later in this notebook, you will learn more about how to amend this file.
You will also design a reinforcement learning agent in agent.py to complete your chosen task.
You are welcome to create any additional files to help you to organize your code. For instance, you may find it useful to define a model.py file defining any needed neural network architectures.
We provide a sample agent in the code cell below to show you how to use the sim to control the quadcopter. This agent is even simpler than the sample agent that you'll examine (in agents/policy_search.py) later in this notebook!
The agent controls the quadcopter by setting the revolutions per second on each of its four rotors. The provided agent in the Basic_Agent class below always selects a random action for each of the four rotors. These four speeds are returned by the act method as a list of four floating-point numbers.
For this project, the agent that you will implement in agents/agent.py will have a far more intelligent method for selecting actions!
import random
class Basic_Agent():
def __init__(self, task):
self.task = task
def act(self):
new_thrust = random.gauss(450., 25.)
return [new_thrust + random.gauss(0., 1.) for x in range(4)]
Run the code cell below to have the agent select actions to control the quadcopter.
Feel free to change the provided values of runtime, init_pose, init_velocities, and init_angle_velocities below to change the starting conditions of the quadcopter.
The labels list below annotates statistics that are saved while running the simulation. All of this information is saved in a text file data.txt and stored in the dictionary results.
%reload_ext autoreload
%autoreload 2
import csv
import numpy as np
from task import Task
# Modify the values below to give the quadcopter a different starting position.
runtime = 5. # time limit of the episode
init_pose = np.array([0., 0., 10., 0., 0., 0.]) # initial pose
init_velocities = np.array([0., 0., 0.]) # initial velocities
init_angle_velocities = np.array([0., 0., 0.]) # initial angle velocities
file_output = 'data.txt' # file name for saved results
# Setup
task = Task(init_pose, init_velocities, init_angle_velocities, runtime)
agent = Basic_Agent(task)
done = False
labels = ['time', 'x', 'y', 'z', 'phi', 'theta', 'psi', 'x_velocity',
'y_velocity', 'z_velocity', 'phi_velocity', 'theta_velocity',
'psi_velocity', 'rotor_speed1', 'rotor_speed2', 'rotor_speed3', 'rotor_speed4']
results = {x : [] for x in labels}
# Run the simulation, and save the results.
with open(file_output, 'w') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(labels)
while True:
rotor_speeds = agent.act()
_, _, done = task.step(rotor_speeds)
to_write = [task.sim.time] + list(task.sim.pose) + list(task.sim.v) + list(task.sim.angular_v) + list(rotor_speeds)
for ii in range(len(labels)):
results[labels[ii]].append(to_write[ii])
writer.writerow(to_write)
if done:
break
Run the code cell below to visualize how the position of the quadcopter evolved during the simulation.
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(results['time'], results['x'], label='x')
plt.plot(results['time'], results['y'], label='y')
plt.plot(results['time'], results['z'], label='z')
plt.legend()
_ = plt.ylim()
The next code cell visualizes the velocity of the quadcopter.
plt.plot(results['time'], results['x_velocity'], label='x_hat')
plt.plot(results['time'], results['y_velocity'], label='y_hat')
plt.plot(results['time'], results['z_velocity'], label='z_hat')
plt.legend()
_ = plt.ylim()
Next, you can plot the Euler angles (the rotation of the quadcopter over the $x$-, $y$-, and $z$-axes),
plt.plot(results['time'], results['phi'], label='phi')
plt.plot(results['time'], results['theta'], label='theta')
plt.plot(results['time'], results['psi'], label='psi')
plt.legend()
_ = plt.ylim()
before plotting the velocities (in radians per second) corresponding to each of the Euler angles.
plt.plot(results['time'], results['phi_velocity'], label='phi_velocity')
plt.plot(results['time'], results['theta_velocity'], label='theta_velocity')
plt.plot(results['time'], results['psi_velocity'], label='psi_velocity')
plt.legend()
_ = plt.ylim()
Finally, you can use the code cell below to print the agent's choice of actions.
plt.plot(results['time'], results['rotor_speed1'], label='Rotor 1 revolutions / second')
plt.plot(results['time'], results['rotor_speed2'], label='Rotor 2 revolutions / second')
plt.plot(results['time'], results['rotor_speed3'], label='Rotor 3 revolutions / second')
plt.plot(results['time'], results['rotor_speed4'], label='Rotor 4 revolutions / second')
plt.legend()
_ = plt.ylim()
When specifying a task, you will derive the environment state from the simulator. Run the code cell below to print the values of the following variables at the end of the simulation:
task.sim.pose (the position of the quadcopter in ($x,y,z$) dimensions and the Euler angles),task.sim.v (the velocity of the quadcopter in ($x,y,z$) dimensions), andtask.sim.angular_v (radians/second for each of the three Euler angles).# the pose, velocity, and angular velocity of the quadcopter at the end of the episode
print(task.sim.pose)
print(task.sim.v)
print(task.sim.angular_v)
In the sample task in task.py, we use the 6-dimensional pose of the quadcopter to construct the state of the environment at each timestep. However, when amending the task for your purposes, you are welcome to expand the size of the state vector by including the velocity information. You can use any combination of the pose, velocity, and angular velocity - feel free to tinker here, and construct the state to suit your task.
A sample task has been provided for you in task.py. Open this file in a new window now.
The __init__() method is used to initialize several variables that are needed to specify the task.
PhysicsSim class (from physics_sim.py). action_repeats timesteps. If you are not familiar with action repeats, please read the Results section in the DDPG paper.state_size), we must take action repeats into account. action_size=4). You can set the minimum (action_low) and maximum (action_high) values of each entry here.The reset() method resets the simulator. The agent should call this method every time the episode ends. You can see an example of this in the code cell below.
The step() method is perhaps the most important. It accepts the agent's choice of action rotor_speeds, which is used to prepare the next state to pass on to the agent. Then, the reward is computed from get_reward(). The episode is considered done if the time limit has been exceeded, or the quadcopter has travelled outside of the bounds of the simulation.
In the next section, you will learn how to test the performance of an agent on this task.
The sample agent given in agents/policy_search.py uses a very simplistic linear policy to directly compute the action vector as a dot product of the state vector and a matrix of weights. Then, it randomly perturbs the parameters by adding some Gaussian noise, to produce a different policy. Based on the average reward obtained in each episode (score), it keeps track of the best set of parameters found so far, how the score is changing, and accordingly tweaks a scaling factor to widen or tighten the noise.
Run the code cell below to see how the agent performs on the sample task.
import sys
import numpy as np
import pandas as pd
from agents.policy_search import PolicySearch_Agent
from task import Task
%reload_ext autoreload
%autoreload 2
num_episodes = 500
task = Task(init_pose=np.array([0., 0., 10, 0., 0., 0.]),
init_velocities=np.array([0., 0., 0.]),
init_angle_velocities=np.array([0., 0., 0.]),
runtime=10.,
target_pos=np.array([0., 0., 20.]),
)
agent = PolicySearch_Agent(task)
for i_episode in range(1, num_episodes+1):
state = agent.reset_episode() # start a new episode
while True:
action = agent.act(state)
next_state, reward, done, _ = task.step(action)
agent.step(reward, done)
state = next_state
if done:
print("\rEpisode = {:4d}, score = {:7.3f} (best = {:7.3f}), noise_scale = {}".format(
i_episode, agent.score, agent.best_score, agent.noise_scale), end="") # [debug]
break
sys.stdout.flush()
This agent should perform very poorly on this task. And that's where you come in!
Amend task.py to specify a task of your choosing. If you're unsure what kind of task to specify, you may like to teach your quadcopter to takeoff, hover in place, land softly, or reach a target pose.
After specifying your task, use the sample agent in agents/policy_search.py as a template to define your own agent in agents/agent.py. You can borrow whatever you need from the sample agent, including ideas on how you might modularize your code (using helper methods like act(), learn(), reset_episode(), etc.).
Note that it is highly unlikely that the first agent and task that you specify will learn well. You will likely have to tweak various hyperparameters and the reward function for your task until you arrive at reasonably good behavior.
As you develop your agent, it's important to keep an eye on how it's performing. Use the code above as inspiration to build in a mechanism to log/save the total rewards obtained in each episode to file. If the episode rewards are gradually increasing, this is an indication that your agent is learning.
## TODO: Train your agent here.
import warnings; warnings.simplefilter('ignore')
from ddpg_agent.agent import DDPG, Q_a_frames_spec
from ddpg_agent.quadcopter_environment import QuadcopterState
from ddpg_agent.visualizations import plot_quadcopter_episode, plot_scores, visualize_quad_agent
%matplotlib inline
num_episodes = 500
agent = DDPG(task, ou_mu=0, ou_theta=.3, ou_sigma=1,
discount_factor=.9, replay_buffer_size=50000, replay_batch_size=1024,
tau_actor=.4, tau_critic=.6,
# relu_alpha_actor=.01, relu_alpha_critic=.01,
lr_actor=.00001, lr_critic=.0001,
activation_fn_actor='tanh',
do_preprocessing=False,
# normalize_rewards=False,
activity_l2_reg=.003,
)
def episode_callback(episode_num):
last_training_episode = agent.history.training_episodes[-1]
if episode_num%10==0:
fig = plot_quadcopter_episode(last_training_episode)
display(fig)
agent.set_episode_callback(episode_callback)
def max_training_score_callback(episode):
last_training_episode = agent.history.training_episodes[-1]
print("New best training score.")
fig = plot_quadcopter_episode(last_training_episode)
display(fig)
agent.set_max_training_score_callback(max_training_score_callback)
def max_test_score_callback(episode):
last_test_episode = agent.history.test_episodes[-1]
print("New best test score.")
fig = plot_quadcopter_episode(last_test_episode)
display(fig)
agent.set_max_test_score_callback(max_test_score_callback)
def rolling_mean(x,N):
# From https://stackoverflow.com/a/22621523/338676
return np.convolve(x, np.ones((N,))/N, mode='valid')
agent.train_n_episodes(num_episodes, eps=.05, act_random_first_n_episodes=50 )
agent.train_n_episodes(500, eps=.02, act_random_first_n_episodes=50 )
plot_quadcopter_episode(agent.history.test_episodes[-1])
Once you are satisfied with your performance, plot the episode rewards, either from a single run, or averaged over multiple runs.
plot_scores([ep.score for ep in agent.history.training_episodes], [ep.score for ep in agent.history.test_episodes])
plot_scores(rolling_mean([ep.score for ep in agent.history.training_episodes],10),
rolling_mean([ep.score for ep in agent.history.test_episodes],10))
Question 1: Describe the task that you specified in task.py. How did you design the reward function?
Answer:
The reward function I defined in task.py was designed to reward the agent for being within 10m of the target position with increasing reward magnitude as it approaches the center of the target. The intention is that it might learn to maximize rewards by hovering at the target position until the episode ends.
if vert_dist<10 and horiz_dist<10:
reward += 10-vert_dist
reward += .1*(10-horiz_dist)
The reward is split into two components, one for vertical distance from goal and the other for horizontal distance from goal, with a factor of .1 applied to horizontal distance and 1 for vertical distance as vertical distance is, somwhat subjectively, considered to be more important for this task.
Additionally, some noise is added to the starting position at the beginning of each episode instead of starting at the same place every time. In Task.reset:
# Add some noise to the starting position
self.sim.pose[:3] += np.random.normal(0,3,3)
Similar to data augmentation in traditional supervised machine learning, this approach is intended to help the agent learn a policy that will better generalize to unvisited states. It will also allow the agent to visit higher-value states early on just by having the "luck" of starting out at a higher position which should help with training. As expected, this starting position noise also has the effect of increasing the variability of the scores as some episodes will inherantly be more difficult than others due their starting positions.
Question 2: Discuss your agent briefly, using the following questions as a guide:
Answer: I chose to stick with the provided DDPG algorithm but added several features/options to enable experimentation with different neural network configurations/parameters, like Batch Normalization, Leaky ReLUs, or data input preprocessing. In the end the following set of paramters seemed to work well for training the quadcopter on the task described above:
mu=0, theta=.3, sigma=1,
discount_factor=.9, lr_actor=.00001, lr_critic=.0001,
tau_actor=.4, tau_critic=.6,
activation_fn_actor='tanh',
activity_l2_reg=.003,
normalize_rewards=True,
Probably the most significant and sensitive parameter is activity_l2_reg which is an effective means of penalizing extreme action values. Combined with activation_fn_actor='tanh' this activity regularization incentivizes the agent to keep the controls near the center, which appears to be a helpful hint to the agent.
The discount factor (gamma) is set to .9 to encourage getting to high-value states quickly while still giving states farther in the future enough value to overcome the regularization losses applied by the activity L2 regularizer.
Additionally, this agent normalizes rewards during the training step. As suggested in this stackexchange answer quoting Andrej Karpathy, normalizing rewards is helpful because it controls the variance of the Critic by "encouraging and discouraging roughly half of the performed actions".
During training the agent makes uniform random actions for first 50 episodes, a modification inspired by the start_steps parameter in OpenAI's implementation of DDPG. After these initial training episodes the amplitude of exploration noise is regulated with the parameter eps, a factor applied to the Ornstein–Uhlenbeck noise. For the training run above it is set to .05 for the first 500 episodes, and .02 for the next 500 episodes. These values were determined through hands-on experimentation to be appropriate to allow the agent to keep exploring without immediately crashing it by applying too much noise.
In MountainCarContinuous-v0.ipynb I show this agent learning OpenAI Gym's Mountain Car Continuous problem. The Mountain Car problem is well-suited for 2-dimensional visualization as it has a state size of 2 and action size of 1, so in that notebook I've created an animation showing how the Q-function and policy changes as this agent learns the task.
Question 3: Using the episode rewards plot, discuss how the agent learned over time.
Answer: In the training run above the agent gradually learns over 1000 episodes, though the learning rate begins to taper off after about 600 episodes. After 1000 episodes the mean score for the last 10 test episodes is around 300. There is significant variablity in the scores, much of it due to the noise added to the starting position, but these results indicate that the agent still hasn't learned to reliably fly.
Question 4: Briefly summarize your experience working on this project. You can use the following prompts for ideas.
Answer:
This was a rather difficult task. The larger observation space and 4-dimensional action space made for an inherantly more difficult problem to solve than, for example, the Mountain Car task. Additionally, the quadcopter is extremely sensitive to noise in the action space which can easily knock the agent out of the sky. Finding the right balance of exploration/exploitation proved to be rather challenging and a satisfactory result is still elusive. While in some episodes it appears the agent is making good corrective actions to keep stable flight, the agent still hasn't learned to hover.
In most training runs during experimentation the agent would usually end up getting stuck at a local optima, for example slamming all or half of the motors all the way to 900 for the whole episode, achieving a good score but then getting stuck in that suboptimal behavior. I didn't find an effective means of getting it unstuck from these policies, but instead resorted to trying to keep it from falling into those ruts by using an action regularizer (incentivizing keeping the controls near the center). This strategy has shown some success, but even training runs with an action regularizer will often end up getting stuck with a suboptimal policy always acting right in the middle.
This task was made more challenging by the requirement to define the reward function as well as the learning agent. Moving the goal posts while simultaneously designing the learning agent made it difficult to know when I was making progress or just spinning my wheels. I found it fascinating that the DDPG agent was faced with this very same problem, training the actor while simultaneously moving the goal posts (the critic). In order to address this challenge, once I had defined a reward function that I thought should be good enough to incentivize the agent to learn to hover I decided not too much but instead to focus on designing an agent that could learn from this reward signal.